DOCUMENT RESUME 



ED 402 571 



CS 012 689 



TITLE 



INSTITUTION 
SPONS AGENCY 
REPORT NO 
PUB DATE 
NOTE 

PUB TYPE 



Quality and Utility: The 1994 Trial State Assessment 
in Reading. The Fourth Report of the National Academy 
of Education Panel on the Evaluation of the NAEP 
Trial State Assessment: 1994 Trial State Assessment 
in Reading. 

National Academy of Education, Stanford, Calif. 
Department of Education, Washington, DC. 
ISBN-0-942469-09-7 
96 

221p. 

Reports - Research/Technical (143) 



EDRS PRICE MF01/PC09 Plus Postage. 

DESCRIPTORS *Data Analysis; Elementary Secondary Education; 

Limited English Speaking; Program Design; ^Reading 
Achievement; *Reading Research; *Test Use; *Test 
Validity 

IDENTIFIERS *National Assessment of Educational Progress; *Trial 

State Assessment (NAEP) 



ABSTRACT 

This report evaluates the conduct, validity, and uses 
of the National Assessment of Educational Progress (NAEP) Trial State 
Assessment (TSA) ♦ The report addresses such pressing problems as how 
participation in NAEP can be maintained and appropriate samples can 
be achieved; how errors can be minimized in the complex process of 
scaling and analyzing data; how the definition of achievement levels 
can be accomplished; how inclusion of children with limited English 
proficiency or disabilities can be included and reported; how private 
schools can be included and reported; and how the NAEP state 
assessments relate to the national NAEP. After an introduction, 
sections of the report are The Content Validity of the 1994 Reading 
Assessment; Sampling and Assessment Administration for the 1994 TSA; 
The assessment of Students with Disabilities or Limited English 
Speaking Proficiency; Scaling and Analysis of the 1994 Reading 
Assessment; Reading Achievement Levels; Reporting and Dissemination 
for the 1994 Reading Assessment; and Conclusions and Recommendations. 
Contains 66 references, and 21 tables and 7 figures of data. 
Appendixes present detailed scoring guides and examples of student 
responses for sample assessment times shown in figure 2.1; reading 
experts participating in the panel's content validity study for the 
1994 TSA; and synopses of studies for the National Academy of 
Education Panel on the Evaluation of the National Assessment of 
Educational Progress Trial State Assessment. (RS) 



* * * * * * * * * * A * * A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 

* Reproductions supplied by EDRS are the best that can be made 

* from the original document. ’ ,f 

A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 



Cg Q(<^ k 




r- 

tn 



<N 

o 



s 



Quality and Utility: 



7fre 19,94 Tn'a/ State Assessment in Reading 



The Fourth Report of 

The National Academy of Education Panel 

on the Evaluation of the NAEP Trial State Assessment: 

1994 Trial State Assessment in Reading 



Panel Chairmen 

Robert Glaser, University of Pittsburgh 
Robert Linn, University of Colorado 



U.8. OEPARTMfNTOFEOUCATION 

Office of Educational Research and Improvement 

educational RESOURCES INFORMATION 
J CENTER (ERIC) 



This document has been reproduced as 
recetved from the person or organization 
originating it. 



□ Minor changes have been made to improve 
reproduction quality. 



• Points of view or opinions stated in this docu- 
ment do not necessarily represent official 
OERI position or policy 



Project Director 

George Bohrnstedt, American Institutes for Research 




BEST COPY AVAILABLE 



2 



Quality and Utility: 

The 1994 Trial State Assessment in Reading 



O 

ERIC 



3 



o 

ERIC 



The contents of this book were developed 
under a grant from the Department of Education. 
However, those contents do not 
necessarily represent the policy of the 
Department of Education and you 
should not assume endorsement by the Federal Government. 

Copyright © 1996 by The National Academy of Education 
ISBN 0-942469-09-7 
All rights reserved. 

Printed on recycled paper in the United States of America 
The National Academy of Education 
Stanford University 
School of Education CERAS-108 
Stanford, CA 94305-3084 
(415) 725-1003 

Designed and printed by Armadillo Press, Mountain View, CA 

4 



Quality and Utility: 

The 1994 Trial State Assessment in Reading 



♦ 



The Fourth Report of 

The National Academy of Education Panel 

on the Evaluation of the NAEP Trial State Assessment: 

1994 Trial State Assessment in Reading 



Panel Chairmen 

Robert Glaser, University of Pittsburgh 
Robert Linn, University of Colorado 



Project Director 

George Bohrnstedt, American Institutes for Research 



The National Academy of Education 



The National Academy of Education is composed of scholars and education leaders 
who “promote scholarly inquiry and discussion concerning the ends and means of 
education, in all its forms, in the United States and abroad.” Our current active 
membership is limited to 125 scholars. The heart of the Academy is found in the 
lively discussions that take place in our regular meetings and in the special panels and 
committees that we establish. Throughout our 31-year history the Academy has been 
called upon by governmental and other agencies to conduct special studies and 
reviews on education issues of public interest, ranging from desegregation to the 
teaching of reading to standards-based reform. During the past six years, a panel of 
the Academy has been monitoring, studying, and making recommendations 
concerning the conduct of the Trial State Assessments given since 1990 in conjunction 
with the National Assessment of Educational Progress. For the first time in the history 
of NAEP, these assessments allow state-by-state comparisons of education 
achievement; they have proven to be of great interest to educators. The Academy’s 
Panel has prepared three in-depth biennial reports on the trial state assessments of 
1990, 1992, and, in this report, 1994. They also issued, at the request of the National 
Center for Education Statistics, which oversees the Panel’s work, a special report on 
the setting of achievement levels in connection with NAEP. 

To carry out the Panel’s challenging assignment, the Academy engaged two 
outstanding education researchers, Robert Glaser and Robert Linn, as co-chairs, and 
George Bohrnstedt and his colleagues at the American Institutes of Research as 
subcontractors to assist in the conduct of the research and writing, as well as a panel 
of distinguished educators and researchers with diverse forms of expertise on the 
NAEP program. They have constructed and overseen an ongoing set of research and 
policy papers examining major issues concerning the state trials of NAEP, including 
validity, sampling, content, data analysis, and reporting issues. In the coming year, the 
Panel will conclude its work with a capstone report that will reflect on broad, long- 
term issues regarding the future of state NAEP and its relationship to national NAEP. 



Carl F. Kaestle 

President, The National Academy of Education 



O 



Table of Contents 



Transmittal Letter. xi 

Acknowledgments xiii 

Foreword xv 

The Panel xvii 

Executive Summary. xix 

1 Introduction. ! 1 

2 The Content Validity of the 1994 Reading Assessment 9 

3 Sampling and Assessment Administration for the 1994 TSA 29 

4 The Assessment of Students with Disabilities or 

Limited English Proficiency 53 

5 Scaling and Analysis of the 1994 Reading Assessment 75 

6 Reading Achievement Levels 89 

7 Reporting and Dissemination for the 1994 

Reading Assessment 107 

8 Conclusions and Recommendations 125 

Appendices 141 

Works Cited 193 



List of Abbreviations 



198 



Detailed^ Table of Contents 



Transmittal Letter xi 

Acknowledgments ’ xiii 

Foreword xv 

The Panel xvii 

Executive Summary. xix 

1 Introduction 1 

The Context for the Panel's Evaluation of the 1994 TSA 1 

History of NAEP TSA Evaluations 2 

Guiding Principles 3 

NAEP's Mission as an Independent Indicator of Student Achievement 3 

NAEP's Fundamental Criteria of Excellence: Quality and Utility 4 

Principles for Enabling NAEP Excellence 5 

The Panel's Forthcoming Capstone Report on the Future of NAEP 6 

The Structure of this Report 6 

2 The Content Validity of the 1994 Reading Assessment. 9 

Introduction 9 

The Overall Structure of the Reading Assessment 10 

Organizing Dimensions 14 

Studies Conducted by the Panel 15 

The Panel's Findings 15 

Omission of Items Measuring Reading to Perform a Task at Grade Four. 16 

Problems with the Item Scoring Guides 18 

Problems with the Distribution of Item Difficulties 19 

Too Few Items that Capture the Essential Features of 

Advanced Reading Achievement 21 

Lack of Clarity in the Stance Dimension 23 

Unaddressed Components of the Reading Process 24 

Summary and Recommendations 25 

Specific Recommendations and Suggestions for Improving the 

Reading Assessment 26 

Implementing the Panel's Recommendations: Toward a More Effective 
and Efficient Assessment Development Process 27 

3 Sampling and Assessment Administration for the 1994 TSA 29 

Introduction 29 

Public School Samples 31 

Sampling and Recruitment of Schools 31 

School Participation Rates 33 

Impact of School Nonparticipation 36 

Sampling and Participation of Students 38 

Impact of Under Sampling and Nonparticipation on Student Samples 40 

Weighting 40 



Nonpublic School Samples 41 

The Panel’s 1992 Recommendation to Include Nonpublic School Students 41 

Sampling of Schools 42 

School Participation Rates 44 

The Administration of the 1994 TSA 46 

Comparisons of TSA and National Performance. . 47 

Comparison of Monitored and Unmonitored TSA Sessions 47 

Relationship Between Student Performance and Characteristics of 

the Administration 49 

Summary 50 

4 The Assessment of Students with Disabilities or Limited English Proficiency : 53 

Introduction 53 

Background for Panel Studies. 54 

Exclusion Procedures in Effect through 1994 54 

Exclusion Rates 55 

Characteristics of IEP Students Sampled for the 1994 TSA 58 

The Panel’s Study of Assessability and Exclusions among IEP Students 59 

Assessability 59 

Accommodations. 6l 

Exclusion Process 63 

Comparability Between States 63 

Characteristics of LEP Students Sampled for the TSA 65 

The Panel’s Study of Assessability and Exclusions among LEP Students. 65 

Assessability 67 

Accommodations and Adaptations. 68 

Exclusion Process 68 

Changes in Exclusion Policies for the 1996 Assessment 69 

Summary. 72 

5 Scaling and Analysis of the 1994 Reading Assessment. 75 

Introduction 75 

Overview of NAEP Scoring, Scaling, and Analysis Procedures 77 

Scoring. 78 

Scaling 79 

Analysis. 80 

Recent Innovations in NAEP Methodologies 80 

Investigation of Decline in Reading Achievement at Grade 12 83 

Discovery of Technical Errors 85 

Summary 86 

6 Reading Achievement Levels, 89 

Introduction, 89 

The Setting of Performance Standards on NAEP 90 

Criticisms of the NAEP Achievement Levels. 92 

Findings from the Panel’s Evaluation of the 1992 Mathematics and Reading 
Achievement Levels ’ 93 



ERJT 



3 



Internal Consistency 93 

The External Comparison Studies 95 

NAGB’s Revisitation of the 1992 Achievement Levels 97 

Overview of the Revisitation Study 97 

Sorting Items into Achievement Categories Does Not Confirm Specific 

Cutscores. 98 

Statistical Evidence from the Revisitation Study 100 

Analysis of the Reported Hit Rate 102 

Summary : 102 

Continuing Issues regarding the Achievement-Levels Setting Procedure 103 

Summary and Recommendations for the Use of Achievement Levels 105 

7 Reporting and Dissemination for the 1994 Reading Assessment. 107 

Introduction 107 

The 1994 Reading Assessment Reports 108 

Accuracy of the Assessment Results 109 

Likelihood that Results Will Be Interpreted Correctly by the Intended Audience 110 

Conveying Statistical Significance. 110 

Other Efforts to Improve Interpretability of Results Ill 

Timeliness of Reports. 112 

Analysis Problems and Competition for Resources. 113 

State Review of Results 115 

NCES Adjudication Process. 115 

Summary 116 

Dissemination and Accessibility of the Findings 116 

Release and Press Coverage of the First Look Report 116 

Release of the Reading Report Card. 118 

Other Reports 119 

Suggestions for Increasing the Accessibility of NAEP Data 119 

More Involvement of the Press 119 

Target Audiences with Additional Focused Research Reports 120 

Provide More Examples of Assessment Tasks and Student Responses 121 

Involve the States More in Reporting and Dissemination 121 

Summary. 122 

8 Conclusions and Recommendations 125 

Introduction 125 

The Success of the 1994 TSA 125 

Content Validity 126 

Sampling and Assessment Administration 127 

The Assessment of Students with Disabilities or Limited English Proficiency. 128 

Scaling and Analysis 129 

Achievement Levels. 130 

Reporting and Dissemination 132 

Utility of the TSA I33 

Utility of NAEP Data to the States 133 

Contributions to the National Debate 135 

Limitations on State NAEP Utility 136 




10 



The Impact of State on National NAEP 

The Panel’s Recommendation for the Continuation of State NAEP. 



136 

138 



Appendix A : Detailed Scoring Guides and Examples of Student Responses 

for Sample Assessment Items Shown in Figure 2.1. 141 

Appendix B: Reading Experts Participating in the Panels Content Validity 

Study for the 1994 TSA 153 

Appendix C: Synopses of Studies for The National Academy of Education 

Panel on the Evaluation of the National Assessment of Educational Progress 

Trial State Assessment. 155 

Content Validation of the 1994 National Assessment of Educational Progress 
in Reading: Assessing the Relationship Between the 1994 Assessment and 

the Reading Framework 156 

School and Student Sampling in the 1994 Trial State Assessment: 

An Evaluation 160 

A Study of the Administration of the 1994 Trial State Assessment 164 

Public School Nonparticipation Study 170 

Study of Exclusion and Assessability of Students with Disabilities in the 
1994 Trial State Assessment of the National Assessment of Educational 

Progress : 172 

Study of Exclusion and Assessability of Limited English Proficiency Students 
in the 1994 Trial State Assessment of the National Assessment of 

Educational Progress 176 

The 1994 Reading Anomaly: Report to The National Academy of Education 
on the Drop in the National Assessment of Educational Progress Main 

Assessment (Short-Term Trend) Scores 179 

Reporting the 1994 Reading Results by Achievement Levels. 183 

Impact of the 1992 Trial State Assessment 187 

Perspectives on the Impact of the Trial State Assessments: State Assessment 
Directors, State Mathematics Specialists, and State Reading Specialists 190 

Works Cited 193 

List of Abbreviations 198 




1 - 1 



Jeanne Griffith 

Acting Commissioner 

National Center for Education Statistics 

U.S. Department of Education 

555 New Jersey Avenue 20208-5653 



Dear Jeanne: 

On behalf of The National Academy of Education, I am pleased to transmit to you the 
fourth report of the Academy’s Panel on the Evaluation of the NAEP Trial State 
Assessment, entitled Quality and Utility: The 1994 Trial State Assessment in Reading. 
In it, the Panel evaluates the conduct, validity, and uses of that assessment. They 
apply their guiding principles and their research findings to many policy issues 
concerning the future of the NAEP state assessments. 

This report has been reviewed and approved by The National Academy of Education’s 
Executive Council, acting as a Committee of Readers. Like the preceding reports of 
this panel, we are confident that it will be helpful to policymakers in reaching 
thoughtful decisions about important policy issues concerning the National 
Assessment of Educational Progress. 

This report addresses such pressing problems as how participation in NAEP can be 
maintained and appropriate samples be achieved; how errors can be minimized in the 
complex process of scaling and analyzing the data; how the definition of achievement 
levels can be accomplished; how inclusion of children with limited English proficiency 
or disabilities can be included and reported; how private schools can be included and 
reported; and how the NAEP state assessments relate to the national NAEP. 

In addition to these and other consequential issues surrounding the ongoing 
administration and interpretation of state NAEP, there are a number of longer-term 
issues associated with the future of state and national NAEP and the assessment’s 
relationship to developments in American education. To address these issues, the 
Panel plans shortly to issue a capstone report. Issues surrounding the redesign of 
NAEP stem from fundamental changes occurring in the field of educational assessment 
and from the ongoing education reform movement, aimed at transforming the content 
and structure of American education. To determine NAEP’s most constructive role in 
the midst of these changes is both very difficult and very important. The Academy is 
proud of the contributions the Panel has made to that effort over the past six years, 
and we look forward to its capstone report as the culminating advice of this 
distinguished group of educators and scholars. 



Sincerely, 




Carl F. Kaestle 

President, The National Academy of Education 
Professor of Education, The University of Chicago 



Acknowledgments 

♦ 

This mandated report presents the findings of The National Academy of Education 
concerning its independent evaluation of the 1994 NAEP Trial State Assessment (TSA). 

The Panel once again acknowledges the indispensable support provided by staff at 
the National Center for Education Statistics. Information, practical assistance, and 
reasoned advice have all been offered freely throughout the course of the evaluation, 
gready facilitating the work of the Panel. In particular, retired Commissioner Emerson 
Elliott and his immediate successor, Acting Commissioner Jeanne Griffith, both 
provided thoughtful guidance at many project junctures, while Garry Phillips, Sharif 
Shakrani, and Larry Ogle all have played very active roles in securing required 
technical materials and keeping the Panel apprised of NAEP plans and activities. Ed 
Mooney has continued to assist the staff with the administrative details of the project. 

The courteous and professional collaboration of the NAEP contractors has also 
enriched the Panel’s evaluation and smoothed its progress. Among the many 
Educational Testing Service (ETS) staff who contributed to the Panel’s work on the 
1994 TSA were Eugene Johnson, who provided extensive information and advice 
concerning NAEP's psychometric procedures; Jay Campbell, who answered many 
questions about the reading assessment and also coordinated the special scoring of 
student responses from the Panel’s studies of accessibility and exclusions among 
students with disabilities or limited English proficiency; Dave Freund and Patricia 
O’Reilly, who oversaw the preparation of data sets for the Panel’s analyses; and 
Debbie Kline, who coordinated responses to the Panel’s questions about the 1994 
Technical Report. John Mazzeo of ETS also played a key role by answering questions 
on a wide range of topics and coordinating response to the Panel’s diverse requests. 

Staff at Westat and National Computer Systems (NCS) have also given freely of their 
advice and assistance concerning NAEP’s sampling, administration, and data processing 
procedures. Special thanks are due to Nancy Caldwell and Diane Walsh of Westat for 
facilitating the recruitment of schools for the Panel’s studies of assessibility and 
exclusions, observations of TSA assessment sessions for the Panel’s study of the 
assessment administration, and our attendance at Westat training sessions, as well as for 
generally helping to coordinate the Panel’s field activities with the NAEP schedule. Keith 
Rust of Westat served as an essential resource concerning NAEP’s sampling design and 
implementation, and Brad Thayer of NCS expedited the flow of administration 
documents needed for the Panel’s studies of assessibility and exclusions. 

Thanks are also due to the National Assessment Governing Board (NAGB) staff, including 
Roy Truby, Mary Lyn Bourque, and Ray Fields, for their cooperation and prompt 
response to requests for information. Sue Loomis and other American College Testing 
(ACT) staff graciously welcomed Panel observers at the 1994 standard setting sessions, 
the revisit of the reading achievement levels, and meetings of the Technical Advisory 
Committee for Standard Setting, as well as promptly fulfilling our requests for 
information. Members of the Assessment Subcommittee, Education Information Advisory 
Committee, and Council of Chief State School Officers also welcomed our 
representatives, while Cadell Hemphill coordinated agendas and kept Panel staff 
informed of Subcommittee activities. Mary Baronne of ETS played a similar liaison role 
with respect to meetings and activities of the NAEP Design and Analysis Committee. 



0 

ERIC 



13 



xiii 



The Panel expresses its appreciation to the state assessment directors and curriculum 
specialists who responded to its various surveys on NAEP administration and on the 
utility and impact of NAEP reports, to state department of education employees who 
provided us with state assessment data for the Panel’s study of public school 
nonparticipation, and to the many students and teachers who cooperated with our 
studies of assessibility and exclusions. 

The Panel extends thanks to the staff at American Institutes for Research for 
their role in drafting and redrafting the various chapters that comprise this 
report. Special thanks goes to Fran Stancavage who took the lead in drafting 
most of the chapters. Others contributing heavily to the writing and editing 
were George Bohrnstedt, Jennifer O’Day, Evelyn Hawkins, John Olson, and 
Lorna Bennie. 

Finally, the Panel extends its gratitude to the distinguished researchers who carried out 
the commissioned investigations. Thanks are due to principle investigators P. David 
Pearson (Michigan State University), Lizanne DeStefano (University of Illinois, Urbana- 
Champaign), Bruce Spencer (Northwestern University), Larry Hedges (University of 
Chicago), and Richard Venezky (University of Delaware), as well as to James 
Yesseldyke, Kevin McGrew, and Martha Thurlow (National Center on Educational 
Outcomes) who provided consultation for the Panel’s study of assessibility and 
exclusions. Summaries of the commissioned reports are included in appendix C, this 
volume, while the full text of the reports are contained in volume two. 



O xiv 

ERIC 




Foreword 



♦ 



Since 1990, every cycle of the National Assessment of Educational Progress (NAEP) 
has included an option for states to participate on a voluntary basis and receive 
state-level results in at least one subject area at one grade level. State NAEP 
assessments were first authorized by Congress in 1988, at which time Congress 
mandated that an evaluation of the feasibility and technical adequacy of such 
assessments be conducted for trials in 1990 and 1992. Pursuant with this legislation, 
Trial State Assessments (TSAs) were conducted in eighth-grade mathematics in 1990 
and in fourth- and eighth-grade mathematics and fourth-grade reading in 1992, and 
Congress subsequently extended the trials and the evaluation to include the 1994 
assessment as well. This report has been prepared in response to that mandate and 
provides an evaluation of the 1994 Trial State Assessment (TSA) in fourth-grade 
reading by The National Academy of Education’s Panel on the Evaluation of the Trial 
State Assessment. 

The Panel’s work on the 1994 TSA fourth-grade reading assessment has taken place in 
a period during which numerous innovations in assessment have been implemented 
at the national, state, and local levels. In this context, NAEP serves as a valuable 
independent monitor of status and trends for student achievement as our nation 
proceeds toward improved education for all children and youth. However, NAEP too 
has changed and, in order to be effective, must continue to change and adapt to the 
many requirements posed by new content, new techniques for measuring 
performance, and more inclusive coverage of the nation’s diversity. The Panel 
believes that systematic study of such innovations and their results should continue to 
be an essential part of efforts to enhance our nation’s key independent indicator of 
educational progress. 

This is the fourth of the Panel’s reports. Encompassing numerous studies and 
analytical papers commissioned by the Panel, these reports have served to inform 
technical and policy issues under consideration by Congress, the National Center for 
Education Statistics (NCES), the National Assessment Governing Board (NAGB), and 
the NAEP contractors. 

The first report, Assessing Student Achievement in the States , was issued in March, 
1992. It presented the Panel’s findings and observations on the first TSA, which was 
conducted in eighth-grade mathematics in 1990. More specifically, the report 
presented the Panel’s observations on the assessment’s content validity, sampling, 
administration, scoring and interpretation, as well as on the reporting of results to the 
public and press. While the Panel concluded that the trial was largely a success, a set 
of recommendations for changes was also included in the report. 

The second report, Setting Performance Standards for Student Achievement, issued in 
September, 1993, studied the new set of performance standards, called achievement 
levels, that were being implemented for reporting and interpreting NAEP results. The 
Panel’s report examined the process used for setting achievement levels, the validity 
and reasonableness of the 1992 achievement levels in reading and mathematics, and 
the relationship of NAEP to emerging national education standards. The Panel’s 
report, we believe, has made a valuable contribution to the continuing discussion and 
debate about how best to set performance standards on an assessment such as NAEP. 



O 

ERIC 



15 



XV 



The Trial State Assessment: Prospects and Realities , the Panel’s third report, was issued 
in December, 1993 and examined the 1992 state assessments in reading and 
mathematics as well as critical questions surrounding the continuance of a state NAEP. 
In addition to issues of sampling and administration, content validity, and reporting, 
this report presented a set of guiding principles that could inform, not only the 
recommendations made in the report, but also discussions and decisions concerning 
the TSA made by Congress, NCES, and NAGB. 

The Panel’s forthcoming capstone report will be released in fall, 1996 and will address 
the role of NAEP in education reform and choices that confront NAEP now and as we 
approach the 21st century. Among the latter are choices about how NAEP can best 
incorporate modem understandings of the acquisition and organization of knowledge, 
exploit new technologies, accommodate individuals with special needs, and link 
with other assessment and other educationally relevant data sets to provide richer 
information on the progress of American education. 



Robert Glaser, Chair 
Robert Linn, Co-Chair 



George Bohmstedt, Project Director 
April 1996 



♦ 



The National Academy of Education Panel 
on the Evaluation of the NAEP Trial State 
Assessment Project 



Robert Glaser, Chairman 

Director, Learning Research and Development Center (LRDC) 
and National Research Center on Student Learning 
University of Pittsburgh 

Robert Linn, Chairman 

Co-director, National Center for Research 
on Evaluation, Standards, and Student Testing 
University of Colorado at Boulder 



Anthony Alvarado 

Community Superintendent 
Community School District 2 
NYC Board of Education * 

Gordon M. Ambach 

Executive Director, 

Council of Chief State School Officers 

Lloyd Bond 

Professor, Educational Research 
Methodology 

University of North Carolina at 
Greensboro 

Ann Brown 

Professor of Education in Math, Science, 
and Technology 

Evelyn Lois Corey Fellow in Instructional 
Science 

University of California at Berkeley 

Alonzo Crim 

Professor of Education 
Spelman College 

Pasquale J. DeVito 

Rhode Island Department of Education 

Edmund W. Gordon 

The John M, Musser Professor of 
Psychology, Emeritus, Yale University 
and Distinguished Professor of 
Educational Psychology, 

City University of New York 

Robert Groves 

Professor of Sociology and Associate 
Director 

Joint Program in Survey Methodology 
University of Michigan 



Richard Jaeger 

Excellence Foundation Professor of 
Education and Director, 

Center for Educational Research and 
Evaluation 

University of North Carolina at 
Greensboro 

Lyle Jones 

Research Professor 

L.L. Thurstone Psychometric Laboratory 
Department of Psychology 
The University of North Carolina at 
Chapel Hill 

Mary M. Lindquist 

Fuller E. Callaway Professor of 
Mathematics Education 
Columbus State University 

P. David Pearson 

Michigan State University 

Edward Roeber 

Director of Student Assessment Programs 
Council of Chief State School Officers 

Albert Shanker 

President 

American Federation of Teachers 

Lorrie M. Shepard 

Professor of Education 
University of Colorado at Boulder 

Lauress Wise 

President, Human Resources Research 
Organization (HumRRO) 



Project Staff 

American Institutes for Research 



Learning Research and Development 
Center 



George Bohmstedt, Project Director Elizabeth Rangel 
Fran Stancavage, Associate Project Cindy Yockel 

Director 
Jill Allen 
Lorna Bennie 
Michelle M. Bullwinkle 
Phyllis DuBois 
Dey Ehrlich 
Catherine Godlewski 
Elizabeth Hartka 
Evelyn Hawkins 
David Huang 
Elise McCandless 
Don McLaughlin 
Jennifer O’Day 
John Olson 
Marianne Perie 
Inna Shapotina 
Audrey Struve 
Jin-Ying Yu 




xviil 



Executive Summary 



Quality and Utility : The 1994 Trial State Assessment in Reading 

The National Assessment for Educational Progress (NAEP) has been the nation’s 
leading indicator of academic achievement for more than 25 years, providing fair and 
accurate information about the performance of U.S. students in core subject areas. 
With findings based on representative samples of students in grades 4, 8, and 12, 
NAEP has long been recognized as an unparalleled resource for educators, policy 
makers, and all others concerned with national trends in educational progress. 

Analyses of NAEP trend data in the past decades revealed a significant narrowing of 
the achievement gap between African American and white students during the late 
1970s and 1980s, while subsequent changes in these same data patterns alerted 
educators and policy makers to an apparent reopening of that gap in the 1990s. 
Similarly, NAEP has pointed to important trends in specific subject areas, 
documenting, for example, the declining achievement of U.S. students in science 
between 1970 and 1990, as well as the limited time spent on science instruction in 
most elementary classrooms. These latter data have provided evidential support for 
groups such as the American Association for the Advancement of Science and the 
National Science Foundation, who have argued for greater attention to science 
education in American schools. Finally, NAEP data were also cited in discussions and 
debates that, in 1989, led the governors and then, in 1994, Congress to target 
improved science achievement as an important national education goal. 

In 1988, NAEP’s role was substantially expanded. Responding to increased education 
reform activity and heightened interest in monitoring progress within the states, 
Congress lifted the prohibition against collecting and reporting NAEP data at the state 
level. Public Law 100-297 authorized voluntary state NAEP assessments on a trial basis 
for 1990 and 1992; this authorization was subsequently extended to include a third 
trial state assessment (TSA) in 1994. In recognition of the significance of the state 
NAEP experiment, Congress also called for an independent evaluation of the TSA to 
judge its feasibility, quality, and utility. Under a grant from the National Center for 
Education Statistics (NCES), The National Academy of Education (NAE) established this 
Panel to undertake the evaluation. 

Three previous Panel reports, spanning the first two TSAs, have been submitted to 
Congress. In them, the Panel concluded that the TSAs were successful and should be 
continued. Areas for further study were also identified, including areas in which the 
full consequences would not be evident before the trials were scheduled to end. 
Accordingly the Panel, in its most recent report (on the 1992 TSA), called for a 
continuing evaluation and — having noted that “many of the factors affecting the 
quality and feasibility of state NAEP are the same as those affecting national NAEP” 1 — 
proposed that the evaluation be expanded to include the full NAEP program. It also 
recommended continuing research and development in the important area of 
performance standards. 



1 The National Academy of Education, The Trial State Assessment: Prospects and Realities (Stanford, CA: 
Author, 1993), 104. 



O xix 



ERfC 




With the Improving America’s Schools Act of 1994, Congress adopted many of the 
Panel’s recommendations. Importantly, the legislation authorized NAEP state 
assessments, mandated the continuing independent review of the entire NAEP 
program, and directed that the state assessments and achievement levels be used on a 
developmental basis until the Commissioner of Education Statistics made a final 
determination of their validity and utility. 

In this, its fourth report, the Panel presents recommendations and findings specific 
to its evaluation of the 1994 TSA in fourth-grade reading and offers several general 
conclusions regarding the state NAEP assessments. This fall, the Panel will 
conclude its work by releasing a capstone report that builds on these conclusions, 
and on its previous reports, to address issues in the redesign of NAEP for the year 
2000 and beyond. 



Dimensions of the Evaluation 



In preparing this report, the Panel has drawn upon its extensive experience with the 
previous TSAs as well as studies and papers commissioned specifically for the 1994 
assessment. In particular, the Panel found that the guiding principles articulated in its 
third report to Congress remain highly relevant and continue to shape the perspective 
for its evaluation. These principles, revised and regrouped for clarity, are presented 
on pages 3 through 5 of this report. 

The Panel gathered evidence regarding several dimensions of the 1994 assessment and 
weighed them against its principles. Some of these dimensions — content validity, 
sampling, assessment administration, and reporting and dissemination — have been 
central to the Panel’s considerations of each of the previous TSAs. For 1994, the Panel 
also updated its conclusions regarding the National Assessment Governing Board 
(NAGB) achievement levels and added new emphases on scaling and analysis and on 
the assessment of students with disabilities or limited English proficiency. 

These various dimensions are discussed in chapters 2 through 7 of the report; the 
Panel’s primary conclusions and recommendations on each are presented below. 



Content Validity 

The 1994 NAEP reading assessment marked the second use of the reading framework 
developed in 1991. A portion of the item pool was released to the public and 
replaced between the 1992 and 1994 assessments, but the overall parameters of the 
assessment were held constant, allowing reading trends to be measured for the first 
time on tasks that reflect current understandings of reading and reading assessment. 
The Panel reviewed the framework and items for content validity after each of the two 
assessments and, in each instance, concluded that the NAEP reading assessment was a 
reasonable representation of current theories in reading, a reasonably valid measure of 
reading achievement in the nation, and relevant to everyday classroom practice. The 
Panel commends NAGB and NCES for building a challenging assessment of reading 
achievement that extends beyond simple mastery of the mechanics of reading to 
include the reader’s ability to draw meaning from text and to communicate this 
understanding to others. 



ERIC 



XX 



20 



Furthermore, the Panel concluded that the decision to hold frameworks in reading and 
other content areas constant over several assessment cycles was praiseworthy — a 
judgment that was confirmed by the strong interest of NAEP constituents in using 1994 
results to gauge the progress of their students over time. Based on these findings, the 
Panel recommends that the general structure (framework) of the present reading 
assessment be maintained through the year 2000 or 2002. 

During the evaluation, the Panel’s reading experts also noted aspects of the 
framework and item pool that could be improved, although none of these 
shortcomings were sufficient to undermine the content validity of the 1994 assessment 
in a substantial way. More specifically, the 1994 fourth-grade assessment contained 
relatively few items that were within the scope of the least able students, making it 
difficult to get precise and reliable estimates of achievement for those at the lower end 
of the NAEP scale. Some unevenness of item quality was also observed. Specifically, 
some of the scoring guides for constructed-response items were inconsistent with 
other features of the items or with the directions given to students, and a number of 
the more difficult items failed to capture the essential features of advanced reading 
achievement. The Panel judges each of the above to be areas in which the NAEP 
contractor should begin improvements immediately, in preparation for the next NAEP 
reading assessment. 

Finally, the Panel points out, as it has in its previous reports, that under the current 
funding and development process there is little time for planned, farsighted content 
development. In particular, during years when new frameworks are adopted, the 
NAEP contractor has typically had less than six months in which to develop field test 
materials before these materials must be finalized. Nevertheless, assessment tasks that 
are produced during this one brief period set the tone for all future assessments until 
the next revision of the framework, eight or ten years in the future. 

The Panel therefore recommends that, for every NAEP subject area, NAGB and NCES 
adopt a process that allows new research and development to begin several years 
before the framework is scheduled to be revised and a new trend line begun. This 
research and development could progress in a relatively modest manner through 
successive pilot studies and small-scale trials targeted at particularly challenging 
research problems. When it is time to begin the actual revision, Congress, NAGB and 
NCES should allow for a framework and item development cycle that is substantially 
longer and more integrated than the current one. The Panel further recommends that 
Congress forward fund NAEP in order to facilitate this process. 



Sampling and Assessment Administration 



As it had in 1990 and 1992, the Panel once again concluded that both sampling and 
administration for the 1994 TSA were done well, were generally consistent with best 
practice for major surveys of this kind, and, with the exceptions noted below, produced 
valid and useful state results. Two areas of concern were identified, however. 

First, and most important, substantial problems were found with the samples of 
nonpublic schools that were — at the Panel’s previous recommendation — added to the 
TSA for the first time in 1994. The Panel’s motivation to include nonpublic schools 



O 

ERIC 



2 1 BEST COPY AVAILABLE 



xxi 



was based on its inclusiveness principle , and the Panel’s intention was to aggregate 
the nonpublic school results with those from public schools in order to generate better 
overall state composites. However, NCES adopted more extensive reporting plans 
after determining that it would be difficult to recruit nonpublic schools without 
offering them separate reports of student achievement by type of school control. 

Upon reviewing the evidence, the Panel concluded that the 1994 state samples of 
nonpublic schools were not large enough to support separate reporting. In addition, 
participation rates for originally-sampled nonpublic schools were unacceptably low in 
approximately 40 percent of the states, and final samples were biased, in many cases, 
by the fact that certain kinds of nonpublic schools were much less likely to participate 
than others. 

The Panel recommends that NAGB and NCES stop separate reporting of state-level 
nonpublic school results but, where participation rates are sufficiently high, continue 
reporting state-level results for public and nonpublic schools combined and for public 
schools only. Furthermore, the reports should include prominent warnings about the 
invalidity of simplistic comparisons between public and nonpublic schools in order to 
discourage efforts to derive such comparisons by subtracting public school means 
from the combined public and nonpublic school results. These warnings should be 
illustrated by concrete examples to underscore their significance. 

At the same time, NAGB and NCES should explore alternative strategies (other than 
separate state-level reporting) for motivating the participation of nonpublic schools. 
One proposed course of action would be to offer more detailed reporting of private 
school results at the national level by basing the analyses on aggregated data from the 
national and state samples of nonpublic schools. (For example, NAEP could break out 
results by more detailed categories of private schools.) 

A second area of concern to the Panel involves the participation of public schools. 
Although the Panel found that in 1994, for nearly all states, the participation rates for 
originally-sampled public schools ranged from acceptable to good, strong indications 
have emerged that the data collection burden on the states, especially small states, 
may begin to threaten school and hence state participation rates in years when 
multiple subjects and grades are assessed. 

The Panel recommends that NCES and NAGB consider design changes that could 
decrease sample size requirements or otherwise reduce burden without compromising 
the overall quality of the assessment. Applicable design changes could include 
relatively circumscribed modifications, such as applying the principles of finite 
sampling to create a different set of rules for the smallest states. Reduced respondent 
burden could also be effected as one outcome of a more radical redesign of NAEP, 
and various versions of the latter are currently being debated by NAGB and other 
interested parties. 



The Assessment of Students with Disabilities or 
Limited English Proficiency 



The 1994 assessment cycle occurred at a time when NCES and NAGB were beginning 
to re-examine NAEP policies regarding the exclusion and assessment of students with 
disabilities or limited English proficiency (IEP and LEP students). In 1994, NCES 



xxii 

ERJC 



22 



gathered data from several sources and met with representatives of the disability and 
bilingual communities to discuss the best methods for increasing inclusion in NAEP. 
The Panel, for its part, collected new data for samples of fourth-grade IEP and LEP 
students who had been selected for participation in the TSA, then shared its 
preliminary findings with NCES. The results of these various efforts led to a set of 
revised exclusion procedures and new allowances for accommodated assessment that 
were tried out in the 1995 field test and implemented, in a controlled design, in 1996. 

The Panel’s 1994 study indicated that school personnel in different states tended to 
interpret the (old) exclusion guidelines differently. Thus, on average, IEP students with 
the same level of ability would be included in some states and excluded in others. 
The Panel also found that a high proportion of IEP students (perhaps as many as 85 
percent) could read well enough to participate in NAEP and be included in estimates 
of overall state achievement. However, the current NAEP reading test is not 
particularly well suited to the reading abilities of the many IEP students who are 
reading a grade or more below grade level. A more appropriate measure for these 
students would address the same reading outcomes but be based on less difficult 
reading passages. 

The Panel’s study of LEP students indicated that a significant proportion of LEP 
students also read well enough in English to participate for the purpose of 
contributing to overall state NAEP results. Disturbingly, among the LEP students 
sampled for the Panel’s study (which was limited to LEP students who had attended 
English-speaking schools for at least two years), more than half had been excluded 
from the TSA. This was true even though more than three-quarters of the Panel’s 
sample had been in English-speaking schools for more than four years — essentially 
their entire school careers. The Panel acknowledges that NAEP may not offer these 
students an optimal opportunity to demonstrate their competence, particularly in 
content areas such as mathematics or science. In reading, however, the Panel believes 
that it is reasonable to ask how well these students are able to read in English. 
Moreover, the education fortunes of LEP students may too easily drop from sight if 
they are excluded from major assessment efforts. 

Finally, the Panel’s studies found that teachers of both IEP and LEP students were 
likely to propose testing accommodations for high percentages of their students. Thus, 
when accommodations are offered, inclusion may be increased, but the overall 
numbers of students assessed under standard conditions may actually go down.’ This is 
problematic because scores obtained under nonstandard conditions are much more 
difficult to interpret. 

Based on its research findings, the Panel makes the following recommendations. 

1. NCES and NAGB should continue efforts to encourage greater participation of 
students with disabilities or limited English proficiency in the current NAEP 
assessments. At the same time, they should continue research to identify adaptations 
or accommodations for each of these groups that would provide more valid measures 
of subject-area achievement as specified in the NAEP content frameworks. 

2. Results for students with disabilities or limited English proficiency assessed under 
standard conditions should be aggregated with results for all other students in 
producing the overall and subgroup achievement estimates normally reported for the 
nation and the states. The results for these populations should not be disaggregated or 
reported separately. 





xxiii 



3. NAEP should also work to develop assessments that can measure accurately over a 
broader range of student proficiency levels and thereby provide better estimates at 
both ends of the achievement distribution. For efficiency, such an assessment would 
almost certainly require some adaptive mechanism (computerized or otherwise) for 
matching students with assessment tasks appropriate to their levels of proficiency. 



Scaling and Analysis 



The procedures used for scaling and analysis in the TSA are generally the same as 
those used in the national NAEP, and analyses for the two assessments are largely 
interconnected. In 1994, two technical errors affecting state scores were discovered, 
and an unexplained but statistically significant drop in performance was observed in 
the national reading results at grade 12. These occurrences led the Panel to give 
greater attention to scaling and analysis in its evaluation of the 1994 TSA than it had in 
its previous evaluations. 

In general, the Panel concluded that NCES and its contractors continue to make use of 
sophisticated methods to solve challenging measurement problems posed by recent 
innovations in testing and to produce generally high quality data. At the same time 
however, the system appears to be showing strains that allow errors to creep in, in 
addition to lengthening the time to reporting. Factors contributing to these strains have 
included pressure for frequent enhancements to the assessments, increased analysis 
volume, and policy pressure to reduce time to reporting. 

Recent efforts to report short-term trends based on the main NAEP assessments have 
also uncovered some potential problems, related to the fact that many small 
modifications in items and procedures have been permitted between assessments. In 
particular, the accuracy of the 1994-to-1992 12th-grade equating may have been 
affected because the proportion of multiple-choice items was substantially higher in 
the common item pool that served as the basis for the equating than it was in the 
overall assessment. If the two item types indeed measure somewhat different skills, 
then the link was not a good proxy for the whole. 

While no one major problem with the 1994 scaling or analysis was observed, the 
accumulation of smaller problems suggests that a modified assessment design would 
better fit the size and objectives of the current NAEP program. The Panel therefore 
supports NAGB’s efforts to develop a new, more streamlined design for NAEP. 

In the meantime, the Panel makes the following recommendations to help ensure the 
integrity of NAEP results: 

1. Any significant change in performance on the short-term trends should routinely be 
checked for reasonableness against other sources of trend data — sources such as the 
long-term NAEP trend data and state assessment trend data— before the results of the 
short-term trend are reported. 

2. NCES should conduct or commission additional studies to validate the current 
analysis and scaling models. These studies should include research on the strength of 
the models being employed and the robustness to violations of assumptions. 



O 

ERiC 



xxiv 



24 



3. Additional procedures designed to verily the integrity of the NAEP data prior to its 
release should be investigated, and NCES should continue to give priority to the timely 
release of high quality technical reports that provide thorough documentation of all 
design related, technical, and psychometric activities associated with the assessments. 



Achievement Levels 

♦ 

The achievement levels established by NAGB for the 1992 reading assessment were 
again used for reporting the 1994 assessment. The Panel realizes that reporting by 
performance standards is greatly valued by much of the NAEP (and TSA) constituency. 
Nevertheless, the Panel continues to question the reliability and validity of the current 
achievement levels. At the time of its evaluation of the 1992 achievement levels, the 
Panel concluded that 1) the standard-setting method had led to serious internal 
inconsistencies that could have especially troubling consequences if the mix of item 
types changed over time and 2) the distributions of student performance established 
by the achievement-level cutscores was not reasonable based on comparison to the 
distributions suggested by various non-NAEP measures. In particular, the weight of 
the evidence suggested that the 1992 achievement levels were set too high. 

Although the achievement-levels contractor fielded a study in 1994 that putatively 
addressed the second of these concerns, the Panel concluded that the design of the 
study did not permit confirmation of specific cutscores. The study was therefore not 
particularly informative with respect to the Panel’s conclusion that the cutscores had 
probably been set too high. 

The Panel also examined the results for the 1994 U.S. history and world geography 
achievement levels in order to determine whether they would exhibit better internal 
consistency or a better match to external criteria than had the 1992 reading and 
mathematics achievement levels. 2 In fact, the Panel once again found troubling 
differences in achievement-level cutscores set using dichotomous versus partial-credit 
(extended-response) items. Although not as dramatic as the differences found for the 
1992 achievement-levels, the 1994 results again showed that levels set using extended- 
response items were considerably higher than those set using multiple-choice or 
dichotomously-scored constructed-response items. 

The Panel also examined the achievement levels in relation to performance on the AP 
examination in U.S. history.^ Many colleges and universities give college credit for AP 
courses taken in high school if students score three or better, and the Panel found that 
2.8 percent of the country’s high school seniors met this criterion on the AP U.S. 
history examination in 1994. By contrast, NAEP classified only 1 percent of high 
school seniors at the advanced level in this subject. Moreover, the percentage passing 



2 U.S. history and world geography were assessed nationally in 1994 but were not included in the TSA. 

It was not therefore in the Panel’s purview to conduct a formal evaluation of the achievement levels set 
for these subjects. However, to the extent that the data were readily available, the Panel believed it 
should determine whether or not the results from these new level-setting efforts confirmed the Panel’s 
earlier findings. 

^ Only U.S. history could be considered because no AP examination is offered in world geography. 



the A P criterion would have been even higher if A P programs had been available in 
all U.S. high schools instead of only half of them. These findings provide additional 
evidence that the Governing Board's achievement levels are set too high, that is, that 
the achievement levels identify fewer 12th graders as advanced than actually are 
performing at an advanced level. 

Based on its accumulated evidence concerning the achievement levels and the process 
by which they were set, the Panel makes the following recommendations. 

1. NAGB should institute a competition for the design of new methods for setting 
performance standards for all NAEP subjects with the goal of having a new method in 
place by the time of the year 2000 NAEP assessment. 

2. In the interim, current achievement levels should be accompanied by a warning 
stating that results should be interpreted as suggestive rather than definitive because 
they are based on a methodology that earlier evaluation panels have questioned in 
terms of accuracy and validity. 



Reporting and Dissemination 

In considering the quality of NAEP reporting since the inception of the TSA, the Panel 
has identified four criteria fundamental to successful reporting: 

♦ The accuracy of the results; 

+ The likelihood that the results will be interpreted correctly 
by the intended audience; 

♦ The extent to which the results are accessible and adequately 
disseminated; and 

♦ The timeliness with which the results are made available. 



With respect to the first three criteria, NCES, NAGB, and the NAEP contractors have 
made steady progress. For example, innovative graphic formats intended to convey 
the statistical significance or insignificance of differences between states and across 
time have been tried after each TSA, and the map graphics introduced in 1992 proved 
more successful than earlier efforts. The 1994 reports retained most of these graphics 
and also addressed other concerns about the interpretability and accessibility of the 
results by introducing more charts, visually simplifying the data tables, using more 
white space, and generally shortening the reports. NCES has also begun a series of 
focused reports that highlight specific findings from each assessment cycle, and these 
also have been well received. Two such reports have been scheduled for the 1994 
reading assessment. Further, in response to the expressed need of the state assessment 
directors for a brief and readable summary of results that they could distribute to 
educators and policy makers in their states, NCES produced a four-page brochure that 
was released with the main reading reports in March 1996. 



BEST COPY AVAILABLE 



o 

ERIC 



xxvi 



26 



The 1994 TSA was particularly problematic, however, with respect to timeliness. 
Despite efforts by NCES, NAGB, and the NAEP contractors to speed up reporting, the 
new First Look report, which contained only summary findings for the 1994 reading 
assessment, was not released until April, 1995 (13 months after the administration). The 
main reading reports did not appear for nearly another year after that — the longest lag 
between assessment and reporting that has occurred to date. Factors which contributed 
to the delay included unexpected data problems, shifting program priorities, and 
competition for the services of qualified analysis staff. The Panel strongly encourages 
NCES and NAGB to continue to press for quicker and more timely reporting while also 
being careful to maintain the quality and integrity of the data. 



Utility^ of the TSA 



The final perspective that bears on the overall evaluation of the TSA, and in effect 
subsumes all other perspectives, concerns its utility. As suggested above, utility must 
rest, firstly, on the validity and reliability of the data. Beyond this, the results must be 
timely, accessible, and policy relevant, and the program must be perceived as useful 
and valuable by the major customers of the information it provides — particularly the 
states. To investigate the latter, the Panel commissioned surveys and case studies of 
NAEP’s perceived influence after the release of each round of TSA data, concluding 
with a set of case studies and a mail survey of state assessment directors, mathematics 
specialists, and reading specialists in December, 1995. Throughout its evaluation, the 
Panel also monitored media coverage of NAEP and the TSAs and followed the 
opinions and actions of other NAEP stakeholders. 



Utility of NAEP Data to the States 

For the most part, the Panel concluded from these efforts that state NAEP has become 
a valued indicator of educational progress and has served particularly to provide an 
independent validity check on the states’ own assessments. In Rhode Island, for 
example, the state reading specialist reported that the 1994 TSA reading results 
provided important evidence for the success of an ongoing reading initiative. The 
external monitor role has been especially important during a period when many state 
assessments have undergone radical reform, making upward or downward trends in 
their results particularly difficult to interpret. 

Several factors contribute to state NAEP’s credibility and hence its value to the states 
as an external monitor. These include the assessment’s forward-looking content and 
format, the secure status of its testing materials, and the rigorous statistical standards 
maintained in data collection, analysis, and reporting. (The long lag time to reporting, 
and the lack of a stable assessment schedule against which states can plan, however, 
are two factors cited repeatedly by the states as diminishing state NAEP’s utility.) 

When state NAEP results have yielded dramatic or unexpected results, particularly 
when a state’s students performed worse than expected, considerable public debate 
has followed. North Carolina and California both provide notable, and very different, 
examples of this effect. 




xxvii 

27 



In 1990, North Carolina educators were dismayed to discover that the state’s students 
had done much worse on the NAEP mathematics assessment than the educators had 
expected, based on results from the state’s own assessment, a commercially available, 
norm referenced test. During the subsequent debate and discussion, decision makers 
concluded that North Carolina teachers generally lacked certain key understandings 
that were required to implement their recently introduced, forward-looking 
mathematics curriculum successfully. Information from the NAEP background 
questions on instructional practices helped North Carolina reach this conclusion, and 
the state subsequently undertook an intensive in-service training program that was 
based in part on materials and data from the 1990 NAEP. These remediation efforts 
appeared to be successful in that the 1992 TSA showed a significant gain in eighth- 
grade mathematics achievement for the state. 

In California, educators and the public were also shocked when fourth-grade reading 
achievement estimates from the 1994 TSA showed California performing significantly 
worse than it had done in 1992, and positioned virtually at the bottom of the 
distribution of participating states. This information was particularly important in view 
of the fact that California’s own assessment system has been in disarray for the past 
several years, precluding any meaningful assessment of performance trends from that 
particular source. However, in the resultant furor, most commentators simply pointed 
to the TSA results as further evidence for what they already felt was wrong with the 
state’s education system — whether that was crowded classrooms or the state’s whole 
language reading curriculum. 

Besides using NAEP as an external monitor of achievement, about 60 percent of the 
states that undertook revisions to their mathematics or reading curricula during the 
past five years reported NAEP as a notable source of ideas. Similar numbers referred 
to NAEP as a model, or a source of external validation, for changes to their reading or 
mathematics assessments. State educators, for example, have closely followed NAEP’s 
pioneering efforts to set performance standards, and both assessment directors and 
curriculum support staff have used NAEP’s external credibility to argue for such 
desired objectives as better alignment with National Council of Teachers of 
Mathematics (NCTM) standards or with reading standards based on reading for 
meaning, higher order skills, and real-world reading tasks. 



Contributions to the National Debate 

♦ 

Interestingly, state NAEP has broadened NAEP’s influence not only at the state level, 
as might be expected, but also at the national level. NAEP has been adopted by the 
National Goals Panel as the primary indicator of progress towards goal three, which 
states that “By the year 2000, American students will leave grades four, eight, and 
twelve having demonstrated competency in challenging subject matter...” 4 

NAEP also routinely receives national press coverage after each of its major data 
releases. The coverage has tended to be more widespread when regional media are 
able to report on results for their own states as well as for the nation. Publications 



4 National Education Goals Panel, The National Education Goals Report: Building a Nation of Learners 
(Washington, D.C.: Author, 1991 ), 10 . 




xxviii 



28 



devoted to education news, such as Education Week ; also contain frequent references 
to NAEP, both as a unique source of information about education achievement and as 
a model for current assessment practices. 



The Impact of State on National NAEP 

When state NAEP was authorized by Congress on a trial basis in 1988, one of Congress’ 
central concerns was whether state NAEP would have a deleterious effect on national 
NAEP. By asking this question, Congress was tacitly affirming the importance of 
protecting the integrity of national NAEP and expressing a concern that state NAEP 
might have a negative impact on state participation in national NAEP, especially in the 
case of small states. There is little evidence that this has happened to date. 

Rather, the Panel believes that an implicit, mostly unspoken quid pro quo has 
developed between the states and NAGB, by means of which the states are willing to 
participate in national NAEP at least in part because of the value they get from 
participation in state NAEP. Since 1990, the Panel has observed movement from 
guarded cooperation among participating states to general anticipation when state 
NAEP results are about to be released. Positive attitudes toward state NAEP can only 
grow if NCES and NAGB are successful in addressing the relatively few persistent 
concerns, such as the uncertainty of the assessment schedule, that states have cited 
repeatedly. As a result, the Panel suggests that, in the unlikely event that Congress 
were to recommend the abandonment of state NAEP, the motivation of the states to 
continue in national NAEP could drop precipitously. 

As a result, in contrast to its original conclusion at the end of the evaluation of the 
1990 TSA, which was simply that state NAEP had had no deleterious effect on national 
NAEP, the Panel now believes that the future of national NAEP has become 
intertwined with the future of state NAEP. State NAEP has greatly increased the 
visibility and perceived utility of the entire NAEP program, and suggestions for 
merging the state and national samples continue to arise (although it is not evident 
that such a merger would be feasible or significantly reduce burden). 

There is obviously also an interaction between monies spent on state NAEP and 
monies available to maintain a quality national NAEP program, but the nature of this 
interaction is complex. On the one hand, the substantial funds spent for state NAEP 
cannot then be spent for other NAEP activities. On the other hand, the heightened 
visibility conferred by state NAEP may result in a net increase in national NAEP 
resources. For example, the substantial framework and item development efforts that 
have characterized the last several years have benefited both programs and almost 
certainly would not have been funded without the impetus of state NAEP. 



The Panel’s Recommendation for the 
Continuation of State NAEP 



Based on its evaluation of the TSA s, the Panel concludes that state NAEP has been 
shown to be a valid, reliable, and useful measure of student achievement, and that it 
aligns favorably with the Panel’s quality, utility, and state indicator principles. For 




xxix 



these reasons, the Panel recommends that state NAEP be continued, and that it be 
moved from developmental to permanent status when NAEP is next reauthorized. 
However, in light of its size and cost, the Panel further recommends that the scope 
and function of state NAEP be reviewed regularly, and particularly after any 
substantial change in mission or design. Such re-evaluation should be done in the 
context of the overall NAEP program and with the abiding aim of providing the best 
and most useful information about Student achievement for the nation. 

There are areas, however, in which it is not yet possible to determine the course that 
will best serve NAEP’s mission and goals. Some of these areas, which should continue 
to be examined in the near future, include 

♦ The viability of continuing to assess nonpublic schools in the state 
NAEP program; 

♦ The value and feasibility of grade-12 state assessments; 

♦ The tension between including as many students with disabilities and 
limited English reading skills in the assessment as possible, and the 
cost of doing so; 

♦ The adequacy of the present NAEP design to meet the increasing 
demands of NAEP’s stakeholders while still satisfying the Panel’s 
quality principle; and 

♦ The development of improved performance standards for reporting 
NAEP results. 



In the fall, 1996, the Panel will present its capstone report. Building upon the Panel’s 
previous work, the report will look forward to the year 2000 and beyond, considering 
recommendations for the design of a NAEP program that offers quality assessments for 
the nation and the states and also anticipates the changing nature of education 
practice as the latter will be influenced by technology and by our developing 
knowledge of learning and human cognition. 



O XXX 

ERJC 



30 



1 Introduction 



♦ 



The Context for the Panel’s Evaluation of the 1994 TSA 



Since its inception in 1969, the National Assessment of Educational Progress (NAEP) 
has been the nation’s leading indicator of what American students know and can do. 
The high technical quality of the assessment and its independence from education and 
political pressures have enabled NAEP to reliably monitor changes in education 
achievement and practices for nearly three decades. Moreover, as the only education 
assessment administered to a representative sample of American students, NAEP has 
been able to track changes not only for the population as a whole, but for important 
subgroups as well. For example, analyses of NAEP trend data in the past decades 
revealed a significant narrowing of the achievement gap between African American 
and white students during the late 1970s and 1980s, while subsequent changes in 
these same data patterns alerted observers to an apparent reopening of that gap in the 
1990s. Similarly, NAEP has pointed to important trends in specific subject areas, 
documenting, for example, the declining achievement of American students in science 
between 1970 and 1990, as well as the limited time spent on science instruction in 
most elementary classrooms. These data have provided evidential support for groups 
such as the American Association for the Advancement of Science and the National 
Science Foundation, who have argued for greater attention to science education in 
American schools. Finally, NAEP data were also cited in discussions and debates that 
led the Governors and then Congress to target improved science achievement as an 
important national education goal. 

The National Assessment of Educational Progress has thus been a long standing and 
useful source of information for national education reformers and policy makers. It 
was not until 1988, however, that NAEP could begin to play a similar role for the 
individual states. Prior to the passage of Public Law 100-297 in that year, NAEP was 
prevented both by its design and by its mandate from collecting and reporting data at 
the state level. That prohibition was lifted by Congress in 1988 with its authorization 
of an experimental new component to the NAEP program — the Trial State Assessments 
(TSA). This action reflected both the educational and the political context of the 
times. Specifically, by approving and funding the state-by-state assessment, Congress 
was responding to increased education reform activity in the states and to expanded 
interest in using state-level data to monitor progress. At the same time, by making 
participation in the TSA voluntary, Congress demonstrated its continuing respect for 
the constitutional authority granted to the states for the education of their residents. 
Finally, by authorizing state NAEP on a “trial” basis only, Congress showed recognition 
that such a massive expansion of the NAEP program was unproven in its direct 
effectiveness and in its overall impact. Congress further underscored this recognition 
by mandating that the trials be independently evaluated so as to determine whether 
they provided valid, reliable, and useful data for the states. 

The 1994 state NAEP assessment in reading brings to a close the series of TSAs 
authorized under the original 1988 legislation and subsequently extended through 
1994. With the conclusion of the TSAs comes also the conclusion of the evaluation of 
the program under the auspices of The National Academy of Education (NAE) Panel 




1 



on the Evaluation of the NAEP Trial State Assessment. The purpose of this report is 
twofold: first, to present the Panel’s specific findings and recommendations stemming 
from its evaluation of the 1994 TSA in fourth-grade reading and second, to offer 
several general conclusions regarding the larger TSA program. In a subsequent 
capstone report, the Panel will build on these conclusions to provide analysis and 
recommendations for the entire NAEP program in the year 2000 and beyond. 



History of NAEP TSA Evaluations 



Over the past six years, the NAE Panel has seen the NAEP state assessment program 
grow, albeit somewhat erratically, and become a highly valued feature of the NAEP 
program. In the face of budgetary constraints however, the final 1994 trial, like the 
first, included only one subject at one grade. The largest trial, in 1992, covered two 
subjects at grade 4 and one at grade 8. A trial at the 12th grade was never carried out. 
(See table 1.1.) 

Table 1.1. Grades and subjects assessed in tbe NAEP trial state 
assessments 1990-1994 





1990 


1992 


1994 


Grade 4 




Reading 


Reading 






Math 




Grade 8 










Math 


Math 





Participation by states and other jurisdictions has increased with each assessment 
cycle, reaching a new high of 46 participating jurisdictions with the just completed 
1996 state assessment, which was also the first state assessment that was not labeled 
as a “trial.” Additionally, state trend results, which have been available from every 
assessment since 1992, have helped to sustain the attention of user groups, who 
continue to show strong interest in the results. 

Throughout this period, the NAE Panel has been involved in an ongoing evaluation of 
the TSA program. Three reports were submitted to Congress over the course of the 
first two TSAs. In them, the Panel concluded that the TSAs were successful and 
should be continued. Certain areas for improvement or further study were also noted 
however, including areas in which the full consequences could not be ascertained 
before the trials were scheduled to end. Furthermore, the Panel noted in the last of 
these reports, The Trial State Assessment: Prospects and Realities, that “many of the 
factors affecting the quality and feasibility of state NAEP are the same as those 
affecting national NAEP.” 1 Accordingly, the Panel called for continuing evaluation of 
the full NAEP program, including the state assessment component. It also called for 
continuing research and development in the important area of performance standards. 



1 The National Academy of Education, The Trial State Assessment: Prospects and Realities (Stanford, CA: 
Author, 1993), 104. 




2 



32 



